-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16111 test: new utility daos_sys_logscan #15629
Conversation
Scan a list of engine logfiles to produce a nested dictionary of pools and a sequence of key events such as pool leadership terms, pool map version updates (due to target state changes), rebuild start and progress update events and total rebuild duration. This first version focuses on finding the pool service leader engine log file and producing this information. Future updates to the tool can potentially include finer-grain tracking of operations across all pool storage engine log files. The supporting class LogLine in cart_logparse.py has a tiny change to support this new utility. Signed-off-by: Kenneth Cain <[email protected]>
Ticket title is 'rebuild enhancement: uniform identifier in log messages' |
Test stage Build RPM on Leap 15.5 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/1/execution/node/327/log |
Test stage Build RPM on EL 9 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/1/execution/node/437/log |
Test stage Build RPM on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/1/execution/node/432/log |
Test stage Build DEB on Ubuntu 20.04 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/1/execution/node/431/log |
# rebuild aborted/errored | ||
# rebuild completed/success | ||
|
||
re_rank_assign = re.compile("ds_mgmt_drpc_set_rank.*set rank to (\d+)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W605 invalid escape sequence
usually resolved by using a regex string
re_rank_assign = re.compile("ds_mgmt_drpc_set_rank.*set rank to (\d+)") | |
re_rank_assign = re.compile(r"ds_mgmt_drpc_set_rank.*set rank to (\d+)") |
rank = -1 | ||
for line in log_iter.new_iter(pid=pid): | ||
msg = line.get_msg() | ||
host = line.hostname | ||
datetime = line.time_stamp | ||
# Find engine rank assignment (early in log) | ||
match = self.re_rank_assign.match(msg) | ||
if match: | ||
rank = int(match.group(1)) | ||
print(f"========== rank {rank} logfile {fname} ==========") | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if rank == -1
at the end of this loop?
match = self.re_pmap_update.match(msg) | ||
if match: | ||
puuid = match.group(1) | ||
from_ver = int(match.group(2)) | ||
to_ver = int(match.group(3)) | ||
# ignore if this engine is not the leader | ||
if puuid not in self._pools or rank != self._cur_ldr_rank[puuid]: | ||
continue | ||
term = self._cur_term[puuid] | ||
self._pools[puuid][term]["maps"][to_ver] = {"carryover": False, "from_ver": from_ver, "time": datetime, "rebuild_gens": {}} | ||
#print(f"FOUND pool {puuid} map update {from_ver}->{to_ver} rank {rank}\t{host}\tPID {pid}\t{fname}") | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function has a lot of these of the format
match = ...
if match:
...
I think it would be easier to manage if each was a helper function. Then this function could either loop over each one, or call out to each function. It's generally not pythonic to have large functions.
There are many ways to do this, but some general examples might be (depending on what you need)
def fun1(arg):
print(arg)
def fun2(arg):
print(arg)
all_funs = (fun1, fun2)
for _fun in all_funs:
_fun('test')
def fun1(arg1, arg2):
print(arg1, arg2)
def fun2(arg1, arg2):
print(arg1, arg2)
all_funs_args = (
(fun1, ('arg1', 'arg2')),
(fun2, ('arg1', 'arg2')))
for _fun, args in all_funs_args:
_fun(*args)
continue | ||
term = self._cur_term[puuid] | ||
if term < 1: | ||
print(f"WARN pool {puuid} I don't know what term it is ({term})!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on what tracks overall failure, might want to store warnings and errors somewhere instead of simply printing. Roughly e.g.
def _warn(self, message):
print(f"WARN: {message}")
self._warnings.append(message) # if you want to store the message
self._warnings +=1 # if you just want a count
And then later can do
if self._warnings:
print("Whoah something bad!")
sys.exit(1) # or return 1, etc.
Signed-off-by: Kenneth Cain <[email protected]>
Test stage Build RPM on Leap 15.5 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/2/execution/node/361/log |
Test stage Build RPM on EL 9 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/2/execution/node/368/log |
Test stage Build RPM on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/2/execution/node/367/log |
Test stage Build DEB on Ubuntu 20.04 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/2/execution/node/349/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small suggested fixes.
def _warn(self, wmsg, fname, line): | ||
full_msg = f"WARN file={fname}, line={line.lineno}: " + wmsg | ||
self._warnings.append(full_msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _warn(self, wmsg, fname, line): | |
full_msg = f"WARN file={fname}, line={line.lineno}: " + wmsg | |
self._warnings.append(full_msg) | |
def _warn(self, wmsg, fname, line=None): | |
full_msg = f"WARN file={fname}" | |
if line: | |
full_msg += f", line={line.lineno}" | |
full_msg += f": {wmsg}" | |
self._warnings.append(full_msg) |
# Find rank assignment log line for this file. Can't do much without it. | ||
self._file_to_rank[fname] = rank | ||
if rank == -1 and not self.find_rank(log_iter): | ||
self._warn(f"cannot find rank assignment in log file - skipping", fname, line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._warn(f"cannot find rank assignment in log file - skipping", fname, line) | |
self._warn("cannot find rank assignment in log file - skipping", fname) |
- fix up some flake8 python linting issues Signed-off-by: Kenneth Cain <[email protected]>
Test stage Build RPM on EL 9 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/3/execution/node/356/log |
Test stage Build RPM on Leap 15.5 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/3/execution/node/357/log |
Test stage Build RPM on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/3/execution/node/351/log |
Test stage Build DEB on Ubuntu 20.04 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/3/execution/node/345/log |
Signed-off-by: Kenneth Cain <[email protected]>
Test stage Build RPM on EL 9 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/4/execution/node/357/log |
Test stage Build RPM on Leap 15.5 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/4/execution/node/358/log |
Test stage Build RPM on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/4/execution/node/352/log |
Test stage Build DEB on Ubuntu 20.04 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/4/execution/node/338/log |
Signed-off-by: Kenneth Cain <[email protected]>
Test stage Build RPM on Leap 15.5 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/5/execution/node/363/log |
Test stage Build RPM on EL 9 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/5/execution/node/351/log |
Test stage Build RPM on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/5/execution/node/354/log |
Test stage Build DEB on Ubuntu 20.04 completed with status UNSTABLE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/5/execution/node/348/log |
Signed-off-by: Kenneth Cain <[email protected]>
Signed-off-by: Kenneth Cain <[email protected]>
Signed-off-by: Kenneth Cain <[email protected]>
Signed-off-by: Kenneth Cain <[email protected]>
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15629/11/execution/node/1491/log |
for some reason (SRE-related?) test_daos_management timed out, exceeded 200 seconds time, with 147+ seconds of that time being taken between "Step 1: setUp(): Starting servers" and "Step 2: setUp(): Starting agents [elapsed time since last step: 147.10s]" I will kick off this stage of testing again and see if we get lucky on the next try. |
def get_warnings(self): | ||
"""Return all warnings stored when scanning engine log files""" | ||
return self._warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could make this a property
def get_warnings(self): | |
"""Return all warnings stored when scanning engine log files""" | |
return self._warnings | |
@property | |
def warnings(self): | |
"""Return all warnings stored when scanning engine log files""" | |
return self._warnings |
And instead of
obj.get_warnings()
It would be like
obj.warnings # no parens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in latest commit
def _set_rank(self, log_iter): | ||
print(f"INFO: searching for rank in file {log_iter.fname}") | ||
found = False | ||
for line in log_iter.new_iter(): | ||
# when a rank assignment log line found (engine start) | ||
match = self.re_rank_assign.match(line.get_msg()) | ||
if match: | ||
self._file_to_rank[log_iter.fname] = int(match.group(1)) | ||
found = True | ||
break | ||
|
||
# Future enhancement: what about log rotation (not an engine start scenario)? | ||
return found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit - I'd recommend immediate return when possible
def _set_rank(self, log_iter): | |
print(f"INFO: searching for rank in file {log_iter.fname}") | |
found = False | |
for line in log_iter.new_iter(): | |
# when a rank assignment log line found (engine start) | |
match = self.re_rank_assign.match(line.get_msg()) | |
if match: | |
self._file_to_rank[log_iter.fname] = int(match.group(1)) | |
found = True | |
break | |
# Future enhancement: what about log rotation (not an engine start scenario)? | |
return found | |
def _set_rank(self, log_iter): | |
print(f"INFO: searching for rank in file {log_iter.fname}") | |
for line in log_iter.new_iter(): | |
# when a rank assignment log line found (engine start) | |
match = self.re_rank_assign.match(line.get_msg()) | |
if match: | |
self._file_to_rank[log_iter.fname] = int(match.group(1)) | |
return True | |
# Future enhancement: what about log rotation (not an engine start scenario)? | |
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in latest commit
match = self.re_step_up.match(msg) | ||
if match: | ||
puuid, term = self._get_ps_leader_components(match) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think better to reverse this if
and avoid large nesting
match = self.re_step_up.match(msg) | |
if match: | |
puuid, term = self._get_ps_leader_components(match) | |
match = self.re_step_up.match(msg) | |
if not match: | |
return False | |
puuid, term = self._get_ps_leader_components(match) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in latest commit
match = self.re_step_down.match(msg) | ||
if match: | ||
puuid, term = self._get_ps_leader_components(match) | ||
if term != self._cur_term[puuid]: | ||
self._warn(f"step_down term={term} != cur_term={self._cur_term}", fname, line) | ||
self._cur_ldr_rank[puuid] = -1 | ||
self._cur_ldr_pid[puuid] = -1 | ||
self._cur_term[puuid] = -1 | ||
self._pools[puuid][term]["end_time"] = datetime | ||
if self._debug: | ||
print(f"{datetime} FOUND pool {puuid} END\tterm {term} rank {rank}\t{host}\t" | ||
f"PID {pid}\t{fname}") | ||
return True | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend reversing the if
- handling exit condition first
match = self.re_step_down.match(msg) | |
if match: | |
puuid, term = self._get_ps_leader_components(match) | |
if term != self._cur_term[puuid]: | |
self._warn(f"step_down term={term} != cur_term={self._cur_term}", fname, line) | |
self._cur_ldr_rank[puuid] = -1 | |
self._cur_ldr_pid[puuid] = -1 | |
self._cur_term[puuid] = -1 | |
self._pools[puuid][term]["end_time"] = datetime | |
if self._debug: | |
print(f"{datetime} FOUND pool {puuid} END\tterm {term} rank {rank}\t{host}\t" | |
f"PID {pid}\t{fname}") | |
return True | |
return False | |
match = self.re_step_down.match(msg) | |
if not match: | |
return False | |
puuid, term = self._get_ps_leader_components(match) | |
if term != self._cur_term[puuid]: | |
self._warn(f"step_down term={term} != cur_term={self._cur_term}", fname, line) | |
self._cur_ldr_rank[puuid] = -1 | |
self._cur_ldr_pid[puuid] = -1 | |
self._cur_term[puuid] = -1 | |
self._pools[puuid][term]["end_time"] = datetime | |
if self._debug: | |
print(f"{datetime} FOUND pool {puuid} END\tterm {term} rank {rank}\t{host}\t" | |
f"PID {pid}\t{fname}") | |
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in latest commit
match = self.re_pmap_update.match(msg) | ||
if match: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend reversing if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in latest commit
match = self.re_rebuild_ldr_start.match(msg) | ||
if match: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend reversing if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in latest commit
match = self.re_old_ldr_start.match(msg) | ||
if match: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend reversing if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in latest commit
match = self.re_rebuild_ldr_status.match(msg) | ||
if match: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend reversing if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in latest commit
match = self.re_old_ldr_status.match(msg) | ||
if match: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend reversing if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in latest commit
Signed-off-by: Kenneth Cain <[email protected]>
Required-githooks: true Signed-off-by: Kenneth Cain <[email protected]>
@@ -0,0 +1,567 @@ | |||
#!/usr/bin/env python3 | |||
# | |||
# (C) Copyright 2025 Intel Corporation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New copyright should be HPE
# (C) Copyright 2025 Intel Corporation | |
# (C) Copyright 2025 Hewlett Packard Enterprise Development LP |
@@ -1,5 +1,5 @@ | |||
# /* | |||
# * (C) Copyright 2016-2023 Intel Corporation. | |||
# * (C) Copyright 2016-2025 Intel Corporation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be HPE (and keep the old intel). You'll probably need to commit with --no-verify
so the githooks don't update again.
Wil be fixed after #15678
# * (C) Copyright 2016-2025 Intel Corporation. | |
# * (C) Copyright 2016-2024 Intel Corporation. | |
# * (C) Copyright 2025 Hewlett Packard Enterprise Development LP |
Required-githooks: true Signed-off-by: Kenneth Cain <[email protected]>
@daltonbohning got copyright out of date error here https://github.com/daos-stack/daos/actions/runs/12603658485/job/35129108379?pr=15629 - anything I should change in the PR to resolve that? |
Since #15678 is merged you can merge in latest master and the check should work now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice work which will be useful for support tasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one last copyright
@@ -1,5 +1,6 @@ | |||
# /* | |||
# * (C) Copyright 2016-2023 Intel Corporation. | |||
# * (C) Copyright 2016-2024 Intel Corporation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel copyright should not be updated. You'll have to manually revert to 2016-2023
Signed-off-by: Kenneth Cain <[email protected]>
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15629/18/testReport/ |
Looks like this may be a new instance of Go runtime needing a suppression rule https://daosio.atlassian.net/browse/DAOS-16925 . Currently, Jenkins Build 18 is continuing on to other stages following Unit Tests (where NLT is run). |
Scan a list of engine logfiles to produce a nested dictionary of pools and a sequence of key events such as pool leadership terms, pool map version updates (due to target state changes), rebuild start and progress update events and total rebuild duration. This first version focuses on finding the pool service leader engine log file and producing this information. Future updates to the tool can potentially include finer-grain tracking of operations across all pool storage engine log files.
The supporting class LogLine in cart_logparse.py has a tiny change to support this new utility.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: